In last lesson, you have seen some basic visualization codes to help you explore data. The theme of last lesson is on data while visulization serves as a tool to explore it. In this lesson, we are going to focus on visualzation itself and teach some basic techniques on how to make your figures more appealing. We are also going to cover how to plot charts on maps.
In this section, we will cover how to configure various components of charts in ggplot2. We are going to use custdata again.
custdata<-read.table('custdata.tsv',header=T,sep = '\t')
Chart title is set by using ggtitle function.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
g1 = ggplot(data=custdata, aes(x=age,y=income)) + geom_point(color="blue") + ggtitle("Age vs Income")
g1
As you can see above, R will give default x and y labels based on the variable names. You can overwrite these by using labs() function.
g2 = g1 + labs(x="Customer Age", y="Annual Income")
g2
Sometimes if necessary you may want to limit an axis to a specific range. There are three ways to achieve this.
#Method 1. use xlim
g3 = g2 + xlim(c(0,100))
g3
## Warning: Removed 8 rows containing missing values (geom_point).
#Method 2. use scale_x_continous. This method will remove all points outside the range
g4 = g2 + scale_x_continuous(limits = c(0,100))
g4
## Warning: Removed 8 rows containing missing values (geom_point).
#Method 3. use coord_cartesian. This method wil adjust the display area
g5 = g2 + coord_cartesian(xlim=c(0,100))
g5
In last lesson, you have seen how to add dollar sign to income labels. A more advanced technique is to use a function to alter labels in whatever format you want. The following is an example showing how to do that.
g6<-ggplot(custdata) + geom_bar(aes(x=health.ins))
g7<-g6+scale_x_discrete(labels = function(x) ifelse(x, "Has Insurance", "Without Insurance"))
g7
We may colour data points on a chart by an independent categorical variable. For example, we may want to see if gender make a difference in the relationship between income and age. This could be achieved by:
g8 <- ggplot(data=custdata, aes(x=age,y=income,color=factor(sex))) + geom_point()
g8
You may manually set the colours.
g8_1<-g8 + scale_color_manual(values=c("yellow","blue"))
g8_1
So far, we have been talking about basic components of R charts. We can further configure the outlook of these charts by using theme() function, which allow us to modify the theme settings for every part of a chart. We will cover some most common scenarios in this section. You are encouraged to explore more on your own.
Title is basically text. Configure title thus is to do with setting right arguments in element_text component. Below are some examples.
g9<-g1 + theme(title=element_text(size=20,face="bold",color="green"))
g9
As you can see, all the titles are affected by this setting. If we want to change only the plot title, it can be done like the following:
g10 <- g1 + theme(plot.title=element_text(size=20,face="bold",color="green"))
g10
We can also change the theme of tick text.
g11 <- g7 + theme(axis.text.x=element_text(angle=50,size=10,vjust = 0.5))
g11
We can also change the background colour of a chart.
#this will change the background colour of the whole panel
g12 <- g1 + theme(panel.background = element_rect(fill = "yellow"))
g12
#this will change the backgrond colour of the plot area
g13 <- g1 + theme(plot.background = element_rect(fill = "yellow"))
g13
Grid lines can be configured using panel.grid.* series.
g14 <- g1 + theme(panel.grid.major = element_line(color = "yellow", size = 2), panel.grid.minor=element_line(color = "blue"))
g14
Sometimes, it is more visual effective to put some panels side by side for comparison.
g15 <- ggplot(data=custdata, aes(x=age,y=income)) + geom_point() + facet_wrap(~sex,ncol = 1)
g15
You may try other layout functions, such as facet_grid.
Please refer to Beautiful plotting in R: A ggplot2 cheatsheet for more information.
Exercise
Use LearningANTS data to do good visualization.
There are some other R packages that you may consider for data visualization. One package to recommend is lattice. In this session, we will show some examples on how to use it. We are going to use “Lasagna Triers.csv”, which stores data about customer profiles on lasagna triers.
colclasses = c("integer","integer","numeric","numeric","factor","numeric","numeric","factor","factor","factor","integer","factor","factor")
triers <- read.csv("Lasagna Triers.csv",header = TRUE, colClasses = colclasses)
str(triers)
## 'data.frame': 856 obs. of 13 variables:
## $ Person : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 48 33 51 56 28 51 44 29 28 29 ...
## $ Weight : num 175 202 188 244 218 173 182 189 200 209 ...
## $ Income : num 65500 29100 32200 19000 81400 73000 66400 46200 61100 9800 ...
## $ PayType : Factor w/ 2 levels "Hourly","Salaried": 1 1 2 1 2 2 2 2 2 2 ...
## $ CarValue : num 2190 2110 5140 700 26620 ...
## $ CCDebt : num 3510 740 910 1620 600 950 3500 2860 3180 1270 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 1 2 2 1 ...
## $ LiveAlone: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ DwellType: Factor w/ 3 levels "Apt","Condo",..: 3 2 2 3 1 2 2 2 2 1 ...
## $ MallTrips: int 7 4 1 3 3 2 6 5 10 7 ...
## $ Nbhd : Factor w/ 3 levels "East","South",..: 1 1 1 3 3 1 3 3 3 1 ...
## $ HaveTried: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 2 2 2 ...
lattice allows us to plot histogram easily. For example,
library(lattice)
histogram(~Income, data=triers)
It also allows you to condition histograms on the value of a third party categorical variable. For example,
#Compare income between genders
histogram(~Income | Gender, data=triers)
#Compare income among neighborhoods. It is more effective to show the comparison in this layout
histogram(~Income | Nbhd, data=triers, layout=c(1,3))
Similar to histogram, you can develop density plot easily.
#Compare car value between genders
densityplot(~CarValue | Gender, data = triers, layout=c(1,2), col="black")
This is an example using dot plot.
dotplot(~CarValue | Nbhd, data = triers, layout=c(1,3))
Using lattice, it is very convenient to develop conditional scatter plots on a third party categorical variable.
xyplot(Income~CarValue | Gender, data = triers, layout=c(1,2))
We can build conditional box plots using bwplot function in lattice.
bwplot(Weight~factor(Gender) | factor(Nbhd), data = triers, xlab = "Gender")
t1 <- tapply(triers$Income, INDEX =list(cut(triers$Weight,breaks=10), cut(triers$CarValue,breaks=10)), FUN=mean,na.rm =TRUE)
t1
## (96.3,3.5e+03] (3.5e+03,6.88e+03] (6.88e+03,1.03e+04]
## (142,154] 34790.00 49030.00 63212.50
## (154,165] 31480.56 41759.26 53718.18
## (165,177] 34464.29 45681.25 60941.67
## (177,188] 28750.00 46593.94 56530.00
## (188,200] 31384.21 46917.39 73018.18
## (200,212] 31759.09 53064.52 45377.78
## (212,223] 30241.46 54520.00 78400.00
## (223,235] 29028.57 48550.00 45633.33
## (235,246] 39308.33 36484.62 92237.50
## (246,258] 23766.67 30475.00 36175.00
## (1.03e+04,1.36e+04] (1.36e+04,1.7e+04] (1.7e+04,2.04e+04]
## (142,154] 75433.33 44100.00 65600
## (154,165] 58083.33 65366.67 75300
## (165,177] 65255.56 67566.67 93000
## (177,188] 56062.50 52950.00 66600
## (188,200] 67383.33 60366.67 66600
## (200,212] 61520.00 56150.00 100600
## (212,223] 51916.67 64650.00 87750
## (223,235] 59200.00 41200.00 NA
## (235,246] 47500.00 63750.00 71800
## (246,258] 74500.00 NA 147700
## (2.04e+04,2.37e+04] (2.37e+04,2.71e+04] (2.71e+04,3.05e+04]
## (142,154] NA NA NA
## (154,165] NA 118900.00 92600
## (165,177] 67050 79350.00 NA
## (177,188] 56800 60083.33 NA
## (188,200] 54350 NA 78925
## (200,212] NA NA NA
## (212,223] 48500 82200.00 48000
## (223,235] NA NA NA
## (235,246] 69900 42400.00 NA
## (246,258] NA NA NA
## (3.05e+04,3.39e+04]
## (142,154] NA
## (154,165] NA
## (165,177] 90900
## (177,188] NA
## (188,200] NA
## (200,212] NA
## (212,223] NA
## (223,235] 44400
## (235,246] NA
## (246,258] NA
levelplot(t1)
levelplot(t1, scales=list(x=list(rot=90)))
t2 <- tapply(triers$Income, INDEX =list(triers$Gender, cut(triers$CarValue,breaks=10)), FUN=mean,na.rm =TRUE)
levelplot(t2, scales=list(x=list(rot=90)))
Excercise
Use the three data sets in Chapter 2 of "Data Mining and Business Analytics with R" to do data visualization and develop insights.
“ggmap” is a package developed on top of ggplot2 for visualizing spatial data. It situates contextual information of various kinds of static maps in the ggplot2 plotting framework. The result is an easy, consistent way of specifying plots which are readily interpretable by both expert and audience and safeguarded from graphical inconsistencies by the layered grammar of graphics framework.
One advantage of making the plots with ggplot2 is the layered grammar of graphics on which ggplot2 is based. By definition, the layered grammar demands that every plot consist of five components :
Since ggplot2 is an implementation of the layered grammar of graphics, every plot made with ggplot2 has each of the above elements. Consequently, ggmap plots also have these elements, but certain elements are fixed to map components : the x aesthetic is fixed to longitude, the y aesthetic is fixed to latitude, and the coordinate system is fixed to the Mercator projection.
A basic framework is to get the map and then overlay it with other ggplot2 charts. The following example illustrates the idea.
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.3.2
#you can get lon and lat of a location by zip code
#pizzahut.location$Location <- paste("Singapore", pizzahut.location$Zipcode, sep = " ")
pizzahut.location <- read.csv("PizzaHut.csv",header = TRUE, colClasses = c("character","character","factor","character","numeric","numeric"))
#Define the map and the base_layer, whihc is equivalent to ggplot in previous sections
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
#Now plot the points, and colour the points based on regions
m2<-m1 + geom_point(aes(color=Region))
m2
You can configure how to display the points just like you do it in normal ggplot2 charts. For example, suppose we want to plot the sizes of the points based on a third party variable, you can do it in geom_point function alone.
pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m3 <- m1 + geom_point(aes(color=Region, size=Visits))
m3
You can overlay this further by other chart types. For example,
pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m4 <- m1 + geom_point(aes(color=Region)) + geom_path()
m4
pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m5 <- m1 + stat_bin2d(aes(color=Region,fill=Region))
m5
Please refer to ggmap: sptial visualization with ggplot2 for more details.
ggmap can work on other maps as well and can be plotted in various types.
Sources of maps:
Map types:
Let’s try a few combinations here:
pizzahut.location$Visits = round(rnorm(nrow(pizzahut.location),15000,5000))
m1 <- qmap("Singapore", maptype="satellite", base_layer=ggplot(aes(x=lon, y = lat), data=pizzahut.location), zoom=11, scale=2)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Singapore&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
m6 <- m1 + geom_point(aes(color=Region))
m6
Ploting charts on a map is useful for many spatial analytics. However, a more appealing visulization is to shape areas in a map with regards to different attributes. Packages “raster” and “rgdal” can be used for this purpose.
“CO2Emission.R” is an exmaple to be used.
Exercise
Obtain spatial data and other government data from data.gov.sg and develop a visulation on Singapore map.